Import, explain the variables, and cite the source(s)
Student_Attitude_and_Behavior <- read.csv("~/DATA612/project/Student Attitude and Behavior.csv")
cat(" There are", nrow(Student_Attitude_and_Behavior), "rows, and", ncol(Student_Attitude_and_Behavior),"columns in the Student_Attitude_and_Behavior data." )
## There are 235 rows, and 19 columns in the Student_Attitude_and_Behavior data.
name1 = colnames(Student_Attitude_and_Behavior)[1]
name2 = colnames(Student_Attitude_and_Behavior)[2]
names(Student_Attitude_and_Behavior)
## [1] "Certification.Course"
## [2] "Gender"
## [3] "Department"
## [4] "Height.CM."
## [5] "Weight.KG."
## [6] "X10th.Mark"
## [7] "X12th.Mark"
## [8] "college.mark"
## [9] "hobbies"
## [10] "daily.studing.time"
## [11] "prefer.to.study.in"
## [12] "salary.expectation"
## [13] "Do.you.like.your.degree."
## [14] "willingness.to.pursue.a.career.based.on.their.degree"
## [15] "social.medai...video"
## [16] "Travelling.Time"
## [17] "Stress.Level"
## [18] "Financial.Status"
## [19] "part.time.job"
is_tibble(Student_Attitude_and_Behavior)
## [1] FALSE
as_tibble(Student_Attitude_and_Behavior) -> SAB
is_tibble(SAB)
## [1] TRUE
glimpse(SAB)
## Rows: 235
## Columns: 19
## $ Certification.Course <chr> "No", "No", "Yes"…
## $ Gender <chr> "Male", "Female",…
## $ Department <chr> "BCA", "BCA", "BC…
## $ Height.CM. <dbl> 100, 90, 159, 147…
## $ Weight.KG. <dbl> 58, 40, 78, 20, 5…
## $ X10th.Mark <dbl> 79.0, 70.0, 71.0,…
## $ X12th.Mark <dbl> 64.00, 80.00, 61.…
## $ college.mark <dbl> 80, 70, 55, 58, 3…
## $ hobbies <chr> "Video Games", "C…
## $ daily.studing.time <chr> "0 - 30 minute", …
## $ prefer.to.study.in <chr> "Morning", "Morni…
## $ salary.expectation <int> 40000, 15000, 130…
## $ Do.you.like.your.degree. <chr> "No", "Yes", "Yes…
## $ willingness.to.pursue.a.career.based.on.their.degree <chr> "50%", "75%", "50…
## $ social.medai...video <chr> "1.30 - 2 hour", …
## $ Travelling.Time <chr> "30 - 60 minutes"…
## $ Stress.Level <chr> "Bad", "Bad", "Aw…
## $ Financial.Status <chr> "Bad", "Bad", "Ba…
## $ part.time.job <chr> "No", "No", "No",…
sum(is.na(SAB))
## [1] 0
summary(SAB)
## Certification.Course Gender Department Height.CM.
## Length:235 Length:235 Length:235 Min. : 4.5
## Class :character Class :character Class :character 1st Qu.:152.0
## Mode :character Mode :character Mode :character Median :160.0
## Mean :157.4
## 3rd Qu.:170.0
## Max. :192.0
## Weight.KG. X10th.Mark X12th.Mark college.mark
## Min. : 20.0 Min. : 7.40 Min. :45.00 Min. : 1.00
## 1st Qu.: 50.0 1st Qu.:70.00 1st Qu.:60.00 1st Qu.: 60.00
## Median : 60.0 Median :80.00 Median :69.00 Median : 70.00
## Mean : 60.8 Mean :76.85 Mean :68.78 Mean : 70.66
## 3rd Qu.: 70.0 3rd Qu.:86.25 3rd Qu.:76.00 3rd Qu.: 80.00
## Max. :106.0 Max. :98.00 Max. :94.00 Max. :100.00
## hobbies daily.studing.time prefer.to.study.in salary.expectation
## Length:235 Length:235 Length:235 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 15000
## Mode :character Mode :character Mode :character Median : 20000
## Mean : 32482
## 3rd Qu.: 25000
## Max. :1500000
## Do.you.like.your.degree. willingness.to.pursue.a.career.based.on.their.degree
## Length:235 Length:235
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## social.medai...video Travelling.Time Stress.Level Financial.Status
## Length:235 Length:235 Length:235 Length:235
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## part.time.job
## Length:235
## Class :character
## Mode :character
##
##
##
We change variables name here.
SAB <- SAB %>%
rename(degree_prefer =Do.you.like.your.degree.) %>%
rename(career_willingness = willingness.to.pursue.a.career.based.on.their.degree)%>%
mutate(Department = gsub("B.com ISM", "ISM", Department))%>%
mutate(Department = gsub("B.com Accounting and Finance ", "Accounting and Finance", Department))
datatable(SAB, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )
Based on the dataset, we can see that males outnumber females. Also, we are interested in the distribution of gender across other variables.
# We delete other bars only preserve the gender bar plot
bar1<- ggplot(SAB, aes(x= Gender))+
geom_bar(fill="skyblue")+
labs(title ="Gender Distribution")
bar1
Below is the distribution of students’ time spent on social media and travelling time.
bar4<- ggplot(SAB, aes(x= social.medai...video))+
geom_bar(fill="skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
bar5<- ggplot(SAB, aes(x= Travelling.Time))+
geom_bar(fill="skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
bar4+bar5
We create some plot that compares gender across departments and certification levels.
#we add percentage point here
SAB %>%
count(Department, Gender) %>%
group_by(Department) %>%
mutate(perc = n / sum(n) * 100) %>%
ggplot(aes(x = Department, y = n, fill = Gender, label = sprintf("%.1f%%", perc))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.25, size = 3) +
labs(title = "Department vs Gender", x = "Department", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on the distribution, we found that most of the students completed certification course.
# We recreate the bar plot by changing x and y
ggplot(SAB, aes(x = Gender, fill = Certification.Course)) +
geom_bar(position = "dodge") +
labs(title = "Certification.Course vs Gender", x = "Count", y = "Certification.Course") +
theme_minimal()
Here is the stress level among gender, most of the students are in good status.
#we add percentage point here
stress_gender <- SAB %>%
group_by(Stress.Level, Gender) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(Gender) %>%
mutate(percentage = (count / sum(count)) * 100)
ggplot(stress_gender, aes(x = Gender, y = count, fill = Stress.Level)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = sprintf("%.1f%%", percentage)),
position = position_dodge(width = 0.9), vjust = -0.5, size = 2.5) +
labs(title = "Gender vs Stress", x = "Gender", y = "Count") +
theme_minimal()
Here is the breakdown of hobbies and study time. Most of them study for 30 to 60 minutes. We discovered that students who play video games cannot study for more than four hours.
# We add the daily.studing.time bar here
hobby_studytime<- SAB%>%
group_by(hobbies, daily.studing.time) %>%
summarise(count = n())
## `summarise()` has grouped output by 'hobbies'. You can override using the
## `.groups` argument.
study1<- ggplot(hobby_studytime, aes(x = hobbies, y = count, fill = daily.studing.time)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "hobby v.s. studytime", x = "hobbies", y = "Count") +
theme_minimal()+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
study2<-ggplot(SAB, aes(x=daily.studing.time ))+
geom_bar(fill= "skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
study1+study2
SAB2 <- SAB
SAB2$Gender <- as.numeric(as.factor(SAB2$Gender))
numeric_data <- SAB2[, sapply(SAB2, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45,
addCoef.col = "black")
college.mark and X10th.Mark (0.47): There’s a moderate positive correlation between marks in the 10th grade and college, suggesting that higher marks in the 10th grade are associated with higher college marks.
college.mark and X12th.Mark (0.42): This indicates a moderate positive correlation between 12th-grade marks and college marks.
college.mark and other variables: The correlation coefficients with other variables like salary expectation, height, gender, and weight are close to zero, indicating a very weak to no linear relationship with college marks.
X10th.Mark and X12th.Mark (0.47): This shows a moderate positive correlation, indicating that students who perform well in the 10th grade also tend to perform well in the 12th grade.
Height.CM. and Weight.KG. (0.28): A positive correlation here suggests that as height increases, weight also tends to increase, which is a common physiological correlation.
Gender and Weight.KG. (0.49): This suggests a moderate positive correlation, which might indicate that one gender (coded numerically) tends to weigh more on average.
Other Correlations: Correlations involving salary expectation, height, and gender with 10th and 12th-grade marks are weak, as indicated by coefficients closer to zero.
SAB$daily.studing.time <- factor(SAB$daily.studing.time,
levels = c("0 - 30 minute", "30 - 60 minute",
"1 - 2 Hour", "2 - 3 hour",
"3 - 4 hour", "More Than 4 hour"))
ggplot(SAB, aes(x = daily.studing.time, y = college.mark, fill=daily.studing.time)) +
geom_boxplot() +
labs(title = "College Marks by Daily Studying Time", x = "Daily Studying Time", y = "College Marks")
Students who studied for 0 to 30 minutes a day and those who studied for 30 to 60 minutes seemed to perform similarly.
The 1 - 2 hour boxes are taller in comparison, indicating a greater difference in scores for these students.
The 3 - 4 hour : Students in this category have a slightly higher median mark than the previous groups.
More than 4 hour : The median number of students who study for more than 4 hours is relatively low. It can be seen that the more time they study, the score performance of students will not necessarily increase.
This indicate that quality, rather than quantity, of study time is crucial.
lm_studytime <- lm(college.mark ~ daily.studing.time, SAB)
summary(lm_studytime)
##
## Call:
## lm(formula = college.mark ~ daily.studing.time, data = SAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.32 -8.69 1.31 10.14 29.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.6904 2.3171 29.645 <2e-16 ***
## daily.studing.time30 - 60 minute 1.1738 2.9013 0.405 0.6862
## daily.studing.time1 - 2 Hour 1.6271 3.0688 0.530 0.5965
## daily.studing.time2 - 3 hour 7.2387 3.9572 1.829 0.0687 .
## daily.studing.time3 - 4 hour 6.8296 4.6726 1.462 0.1452
## daily.studing.timeMore Than 4 hour -0.9404 6.0199 -0.156 0.8760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.72 on 229 degrees of freedom
## Multiple R-squared: 0.0229, Adjusted R-squared: 0.001563
## F-statistic: 1.073 on 5 and 229 DF, p-value: 0.3759
The median of the residuals is 1.31, which is close to 0, suggesting that, on average, the model doesn’t systematically over- or under-predict.
The p values associated with these coefficients are all above 0.05, indicating that neither increases nor decreases in study time are statistically significant predictors of college performance.
In summary, the linear model indicates that there is no strong
evidence of a relationship between college.mark and
daily.studing.time, given the lack of statistical
significance for the coefficients and the low R-squared values. The most
substantial association seen is with the “2 - 3 hour” study group, which
might suggest a slight increase in college marks, but this is not
statistically significant at the 5% level.
plota <- ggplot(SAB, aes(x = X12th.Mark, y = college.mark, color = Gender, shape= Gender)) +
geom_jitter(alpha = 1/2, size = 1)+
geom_smooth(method = "lm", se = FALSE)+
labs(title = "12th grade vs. college.mark among Gender")
ggplotly(plota)
## `geom_smooth()` using formula = 'y ~ x'
plotb <- ggplot(SAB, aes(x = X10th.Mark, y = college.mark, color = Gender, shape= Gender)) +
geom_jitter(alpha = 1/2, size = 1)+
geom_smooth(method = "lm", se = FALSE)+
labs(title = "10th grade vs. college.mark among Gender")
ggplotly(plotb)
## `geom_smooth()` using formula = 'y ~ x'
gradeplot1<- ggplot(SAB, aes(x=X10th.Mark, y=college.mark))+
geom_point(color="#a6bddb")+
geom_smooth(method= "lm")+
labs(title = "10th grade vs. college.mark")
gradeplot2<- ggplot(SAB, aes(x=X12th.Mark, y=college.mark))+
geom_point(color="#a6bddb")+
geom_smooth(method= "lm")+
labs(title = "10th grade vs. college.mark")
gradeplot1+gradeplot2
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
lm_score<- lm(college.mark~ X10th.Mark+ X12th.Mark+ Gender, SAB)
tidy(lm_score, conf.int = T)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 21.6 6.58 3.28 0.00118 8.65 34.6
## 2 X10th.Mark 0.403 0.0749 5.39 0.000000177 0.256 0.551
## 3 X12th.Mark 0.326 0.0897 3.64 0.000339 0.150 0.503
## 4 GenderMale -6.62 1.85 -3.57 0.000427 -10.3 -2.97
\(Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \beta_3X_{i3}+ \epsilon_i\),
\(Y_i\) is the student’s marks obtained in college or university,
\(X_{i1}\) is the student’s marks obtained in the 10th grade i,
\(X_{i1}\) is the student’s marks obtained in the 10th grade i,
\(X_{i3}\) is an indicator for male i,
The errors are assumed to have mean 0, constant variance, and are uncorrelated.
The output shows that female students have better academic performance than male’s. Female students have 6.6 higher score, on average, as male students who have the same 10th grade and the same 12th grade.
lm_score11<- lm(college.mark~ X10th.Mark+ X12th.Mark+ part.time.job, SAB)
tidy(lm_score11, conf.int = T)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.0 6.32 2.05 0.0414 0.507 25.4
## 2 X10th.Mark 0.412 0.0768 5.36 0.000000199 0.261 0.563
## 3 X12th.Mark 0.375 0.0910 4.12 0.0000528 0.196 0.554
## 4 part.time.jobYes 1.63 2.32 0.700 0.484 -2.95 6.20
lm_score2<- lm(college.mark~ X10th.Mark+X12th.Mark, SAB)
tidy(lm_score2, conf.int = T)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.2 6.30 2.09 0.0373 0.785 25.6
## 2 X10th.Mark 0.411 0.0767 5.36 0.000000199 0.260 0.562
## 3 X12th.Mark 0.376 0.0908 4.14 0.0000490 0.197 0.555
aout <- augment(lm_score2)
ggplot(data = aout, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Residuals seem to be centered at 0 for all X, we conclude that a linear model is appropriate.
We need to check for normality to verify that we can do a prediction interval
ggplot(data = aout, mapping = aes(sample = .resid)) +
geom_qq() +
geom_qq_line(color="blue")
This evaluates the normality of the error terms. Although it has a left
long tail, overall looks good.
range(SAB$X10th.Mark)
## [1] 7.4 98.0
range(SAB$X12th.Mark)
## [1] 45 94
df1 <- data.frame(X10th.Mark = c(70, 80), X12th.Mark= c(34, 87))
predict(lm_score2, newdata = df1, interval = "confidence")
## fit lwr upr
## 1 54.7717 48.71221 60.83119
## 2 78.8071 75.28575 82.32845
Based on the output provided, we predict that students who score 70 points in the 10th grade and 34 points in the 12th grade are likely to achieve around 55 points for their college grade. (95% Confident Interval 49 to 61) Students obtained 80 points in the 10th grade, and 87 points in the 12th grade will get about 79 points for college grade.(95% Confident Interval 75 to 82)
SAB_degree<-SAB%>%
group_by(degree_prefer)%>%
count()%>%
mutate(frequency = n / 235)
SAB_degree2<-SAB%>%
group_by(degree_prefer, Department)%>%
count()%>%
mutate(frequency = n / 235)
SAB_degree
## # A tibble: 2 × 3
## # Groups: degree_prefer [2]
## degree_prefer n frequency
## <chr> <int> <dbl>
## 1 No 20 0.0851
## 2 Yes 215 0.915
SAB_degree2
## # A tibble: 8 × 4
## # Groups: degree_prefer, Department [8]
## degree_prefer Department n frequency
## <chr> <chr> <int> <dbl>
## 1 No Accounting and Finance 1 0.00426
## 2 No BCA 16 0.0681
## 3 No Commerce 1 0.00426
## 4 No ISM 2 0.00851
## 5 Yes Accounting and Finance 14 0.0596
## 6 Yes BCA 116 0.494
## 7 Yes Commerce 59 0.251
## 8 Yes ISM 26 0.111
91% of student like their degree, especially the The Business Cinematic Arts (BCA) program(49%).
t.test(SAB$salary.expectation ~ SAB$Certification.Course)
##
## Welch Two Sample t-test
##
## data: SAB$salary.expectation by SAB$Certification.Course
## t = -1.627, df = 156.06, p-value = 0.1058
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -40599.500 3925.622
## sample estimates:
## mean in group No mean in group Yes
## 20621.19 38958.13
Since the p-value (0.1058) is greater than 0.05, we fail to reject the null hypothesis. There is not enough evidence that there is a significant difference in mean salary expectations between students who completed certification courses and those who did not.
SAB%>%
group_by(Gender , Certification.Course) %>%
summarise(MeanSalary = mean(salary.expectation))
## `summarise()` has grouped output by 'Gender'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: Gender [2]
## Gender Certification.Course MeanSalary
## <chr> <chr> <dbl>
## 1 Female No 16876.
## 2 Female Yes 47455.
## 3 Male No 22145.
## 4 Male Yes 34140.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(GGally)
library(broom)
library(gridExtra)
library(DT)
library(patchwork)
library(ggthemes)
library(plotly)
library(gapminder)
library(corrplot)
library(car)
Student_Attitude_and_Behavior <- read.csv("~/DATA612/project/Student Attitude and Behavior.csv")
cat(" There are", nrow(Student_Attitude_and_Behavior), "rows, and", ncol(Student_Attitude_and_Behavior),"columns in the Student_Attitude_and_Behavior data." )
name1 = colnames(Student_Attitude_and_Behavior)[1]
name2 = colnames(Student_Attitude_and_Behavior)[2]
names(Student_Attitude_and_Behavior)
is_tibble(Student_Attitude_and_Behavior)
as_tibble(Student_Attitude_and_Behavior) -> SAB
is_tibble(SAB)
glimpse(SAB)
sum(is.na(SAB))
summary(SAB)
SAB <- SAB %>%
rename(degree_prefer =Do.you.like.your.degree.) %>%
rename(career_willingness = willingness.to.pursue.a.career.based.on.their.degree)%>%
mutate(Department = gsub("B.com ISM", "ISM", Department))%>%
mutate(Department = gsub("B.com Accounting and Finance ", "Accounting and Finance", Department))
datatable(SAB, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )
# We delete other bars only preserve the gender bar plot
bar1<- ggplot(SAB, aes(x= Gender))+
geom_bar(fill="skyblue")+
labs(title ="Gender Distribution")
bar1
bar4<- ggplot(SAB, aes(x= social.medai...video))+
geom_bar(fill="skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
bar5<- ggplot(SAB, aes(x= Travelling.Time))+
geom_bar(fill="skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
bar4+bar5
#we add percentage point here
SAB %>%
count(Department, Gender) %>%
group_by(Department) %>%
mutate(perc = n / sum(n) * 100) %>%
ggplot(aes(x = Department, y = n, fill = Gender, label = sprintf("%.1f%%", perc))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(position = position_dodge(width = 0.9), vjust = -0.25, size = 3) +
labs(title = "Department vs Gender", x = "Department", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# We recreate the bar plot by changing x and y
ggplot(SAB, aes(x = Gender, fill = Certification.Course)) +
geom_bar(position = "dodge") +
labs(title = "Certification.Course vs Gender", x = "Count", y = "Certification.Course") +
theme_minimal()
#we add percentage point here
stress_gender <- SAB %>%
group_by(Stress.Level, Gender) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(Gender) %>%
mutate(percentage = (count / sum(count)) * 100)
ggplot(stress_gender, aes(x = Gender, y = count, fill = Stress.Level)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = sprintf("%.1f%%", percentage)),
position = position_dodge(width = 0.9), vjust = -0.5, size = 2.5) +
labs(title = "Gender vs Stress", x = "Gender", y = "Count") +
theme_minimal()
# We add the daily.studing.time bar here
hobby_studytime<- SAB%>%
group_by(hobbies, daily.studing.time) %>%
summarise(count = n())
study1<- ggplot(hobby_studytime, aes(x = hobbies, y = count, fill = daily.studing.time)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "hobby v.s. studytime", x = "hobbies", y = "Count") +
theme_minimal()+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
study2<-ggplot(SAB, aes(x=daily.studing.time ))+
geom_bar(fill= "skyblue")+
theme(axis.text.x= element_text(angle = 45, hjust = 1))
study1+study2
SAB2 <- SAB
SAB2$Gender <- as.numeric(as.factor(SAB2$Gender))
numeric_data <- SAB2[, sapply(SAB2, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45,
addCoef.col = "black")
SAB$daily.studing.time <- factor(SAB$daily.studing.time,
levels = c("0 - 30 minute", "30 - 60 minute",
"1 - 2 Hour", "2 - 3 hour",
"3 - 4 hour", "More Than 4 hour"))
ggplot(SAB, aes(x = daily.studing.time, y = college.mark, fill=daily.studing.time)) +
geom_boxplot() +
labs(title = "College Marks by Daily Studying Time", x = "Daily Studying Time", y = "College Marks")
lm_studytime <- lm(college.mark ~ daily.studing.time, SAB)
summary(lm_studytime)
SAB$Financial.Status <- as.factor(SAB$Financial.Status)
SAB$Stress.Level <- as.factor(SAB$Stress.Level)
SAB$college.mark <- as.numeric(SAB$college.mark)
ggplot(SAB, aes(x = Financial.Status, y = college.mark, fill = Financial.Status)) +
geom_boxplot() +
labs(title = "College Marks by Financial Status", x = "Financial Status", y = "College Marks")
SAB$Financial.Status <- factor(SAB$Financial.Status)
lm_financial <- lm(college.mark ~ Financial.Status , SAB)
summary(lm_financial)
stress_financial<- SAB%>%
group_by(Stress.Level, Financial.Status) %>%
summarise(count = n())
ggplot(stress_financial, aes(x = Financial.Status, y = count, fill = Stress.Level)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Financial vs Stress", x = "Financial", y = "Count") +
theme_minimal()
SAB$Financial.Status <- as.factor(SAB$Financial.Status)
SAB$Stress.Level <- as.factor(SAB$Stress.Level)
lm_stress_finance <- lm(college.mark ~Stress.Level, SAB)
summary(lm_stress_finance)
SAB$Financial.Status <- as.factor(SAB$Financial.Status)
SAB$Stress.Level <- as.factor(SAB$Stress.Level)
lm_stress_finance <- lm(college.mark ~ Financial.Status + Stress.Level, SAB)
summary(lm_stress_finance)
lm_stress_finance_gender <- lm(college.mark ~ Financial.Status + Stress.Level+ Gender, SAB)
summary(lm_stress_finance_gender)
plota <- ggplot(SAB, aes(x = X12th.Mark, y = college.mark, color = Gender, shape= Gender)) +
geom_jitter(alpha = 1/2, size = 1)+
geom_smooth(method = "lm", se = FALSE)+
labs(title = "12th grade vs. college.mark among Gender")
ggplotly(plota)
plotb <- ggplot(SAB, aes(x = X10th.Mark, y = college.mark, color = Gender, shape= Gender)) +
geom_jitter(alpha = 1/2, size = 1)+
geom_smooth(method = "lm", se = FALSE)+
labs(title = "10th grade vs. college.mark among Gender")
ggplotly(plotb)
gradeplot1<- ggplot(SAB, aes(x=X10th.Mark, y=college.mark))+
geom_point(color="#a6bddb")+
geom_smooth(method= "lm")+
labs(title = "10th grade vs. college.mark")
gradeplot2<- ggplot(SAB, aes(x=X12th.Mark, y=college.mark))+
geom_point(color="#a6bddb")+
geom_smooth(method= "lm")+
labs(title = "10th grade vs. college.mark")
gradeplot1+gradeplot2
lm_score<- lm(college.mark~ X10th.Mark+ X12th.Mark+ Gender, SAB)
tidy(lm_score, conf.int = T)
lm_score11<- lm(college.mark~ X10th.Mark+ X12th.Mark+ part.time.job, SAB)
tidy(lm_score11, conf.int = T)
lm_score2<- lm(college.mark~ X10th.Mark+X12th.Mark, SAB)
tidy(lm_score2, conf.int = T)
aout <- augment(lm_score2)
ggplot(data = aout, mapping = aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)+
geom_smooth()
ggplot(data = aout, mapping = aes(sample = .resid)) +
geom_qq() +
geom_qq_line(color="blue")
range(SAB$X10th.Mark)
range(SAB$X12th.Mark)
df1 <- data.frame(X10th.Mark = c(70, 80), X12th.Mark= c(34, 87))
predict(lm_score2, newdata = df1, interval = "confidence")
SAB_degree<-SAB%>%
group_by(degree_prefer)%>%
count()%>%
mutate(frequency = n / 235)
SAB_degree2<-SAB%>%
group_by(degree_prefer, Department)%>%
count()%>%
mutate(frequency = n / 235)
SAB_degree
SAB_degree2
t.test(SAB$salary.expectation ~ SAB$Certification.Course)
SAB%>%
group_by(Gender , Certification.Course) %>%
summarise(MeanSalary = mean(salary.expectation))
5 Social and Economic Factors Influencing Student Life
5.1 Plotting college marks by Stress.Level
What is the impact of financial status on students’ stress levels and academic performance?
The “Fabulous” financial status group has the lowest median college mark. The “Awful” financial status group does not have the lowest median college marks.
This could suggest that financial status may not be the primary determinant of academic success or that there are other factors at play.
5.1.1 Regression Analysis for Financial Status and College Marks
The model’s results suggest that there is no significant relationship between Financial.Status and college.mark.
Given the lack of statistical significance and the very low R-squared values, Financial.Status alone does not appear to be a good predictor of college.mark in this model. This implies that the impact of financial status on academic performance is not linear.
5.2 Plotting college marks by Stress.Level
5.2.1 Regression Analysis for Stress Level and College Marks
The results of this model also show that there is no significant relationship between Stress Level and College.mark because the P-values for different stress levels are all greater than 0.05.
5.3 Mutipule Regression analysis (Financial.Status + Stress.Level)
The regression analysis indicates that neither
Financial.StatusnorStress.Levelare significant predictors ofcollege.markin the context of this model. The adjusted R-squared value being negative is a particularly strong indication that the model has no predictive power.5.3.1 Add Gender as predictor
From the coefficient for males is (-9.3780) and p < 0.05, we find that gender appears to be an important predictor in the model, showing a strong impact on college performance, with men expected to score lower than women.